Image Compression

Compression to exact file sizes using machine learning.
Author

Anweshan Adhikari

Code
# packages
from PIL import Image
import numpy as np
from scipy import ndimage
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
from unittest.mock import patch
import io
import glob
import os
import time

I always get annoyed when a website rejects my image because it’s 3.2MB instead of the required “under 2MB”. Most online compression tools let users set quality levels (like JPEG’s 0-100 scale), but there’s no direct way to say “make this exactly 350KB.” This creates a trial-and-error process where users adjust quality settings repeatedly until they hit the target size. In this blog post, I’ll explore why exact-size compression is challenging and investigate two approaches: a traditional binary search method and an experimental machine learning technique that attempts to predict optimal compression parameters. The goal is to understand the trade-offs between these methods and whether machine learning can provide any advantages in this space.

Why “exact size” compression is difficult

To understand the challenge, we need to look at how compression actually works. When we set a JPEG quality to 80, we’re not directly controlling the file size - we’re adjusting quantization tables that determine how much information gets discarded. The relationship between quality settings and resulting file size is complex and depends on several factors:

Content-dependent compression efficiency: A simple logo compressed at quality 80 might be 20KB, while a detailed photograph at the same setting could be 200KB.

Quantization steps are discrete: Most compression algorithms use block-based processing (8x8 in JPEG), with each block encoded into variable-length bit streams. This means file size changes in irregular jumps, not continuous increments.

Metadata overhead: EXIF data and other metadata can add 10-40KB to file size with no visual impact. Complex optimization space: The mapping between quality settings and file size is non-linear and varies by image content.

Let’s formulate this mathematically. For an image II I, compressed with parameter qq q (quality), the resulting size SS S is:

\(S(I,q) = f(I,q) + M(I)\)

Where f is a non-linear function dependent on image content, and M(I) is metadata overhead. Our goal is to solve for q when S is our target size:

\(Q = f^{-1}(S - M(I), I)\)

The problem is that \(F^-1\) doesn’t have a closed-form solution, and worse, it’s different for every image.

A smarter approach: Machine learning prediction

I was curious whether machine learning could improve this process by predicting the optimal quality setting directly. Let me explain the concept:

What is a prediction model?

A prediction model in this context is an algorithm that learns the relationship between: 1. Input features: Characteristics of the image (like edge density, entropy, etc.) and a desired target size 2. Output: The optimal JPEG quality setting that would produce that target size

Essentially, we’re trying to approximate the inverse of the compression function. Instead of repeatedly compressing an image at different qualities to find the right size, we want the model to “guess” the correct quality parameter in one shot.

The process involves:

  1. Training: We collect examples of images compressed at various quality settings, extracting features and recording the resulting file sizes
  2. Learning: The model identifies patterns between image features, quality settings, and file sizes
  3. Prediction: When given a new image and target size, the model estimates the quality setting that would produce that size

For this experiment, I’ve chosen to use a Gradient Boosting Regressor, which is well-suited for this type of non-linear regression problem. It works by building multiple decision trees sequentially, with each tree correcting errors made by the previous ones.

Experiments and results

Now let’s evaluate our approach with different image types. I collected a dataset of 1,000 diverse images and trained our model using a 80/20 train/test split.

Code
# Load image dataset - check multiple extensions
image_files = []
for ext in ['.jpeg', '.jpg', '.png']:
    files = glob.glob(f"photos/*{ext}")
    if files:
        print(f"Found {len(files)} files with extension {ext}")
        image_files.extend(files)
        
if not image_files:
    raise ValueError("No image files found in the photos directory")
    
print(f"Found {len(image_files)} total images")

# Split into training and testing sets
train_files, test_files = train_test_split(image_files, test_size=0.2, random_state=42)

# Train the prediction model
model = train_size_predictor(train_files)
Found 73 files with extension .jpeg
Found 73 total images
Training model on 58 images...
Processing image 0/58
Processing image 10/58
Processing image 20/58
Processing image 30/58
Processing image 40/58
Processing image 50/58
Top 5 features for prediction:
        Feature  Importance
9       quality    0.757299
5       entropy    0.096391
4  edge_density    0.055317
6      variance    0.050526
1        height    0.015152
Code
def evaluate_compression_methods(test_files, model, target_kb=200):
    """Compare binary search and ML-assisted compression with simple metrics"""
    results = []
    
    for i, img_path in enumerate(test_files[:3]):  # Just test 3 images
        img = Image.open(img_path)
        target_bytes = target_kb * 1024
        
        # Binary search method
        bs_start = time.time()
        bs_iterations = [0]  # Use a list to track iterations inside function
        
        def count_bs_iteration(*args):
            bs_iterations[0] += 1
            return True
            
        with patch('builtins.print', side_effect=count_bs_iteration):  # Suppress output
            bs_result = compress_to_target_size(img, target_bytes)
        
        bs_time = time.time() - bs_start
        bs_error = abs(len(bs_result) - target_bytes) / target_bytes * 100
        
        # ML method
        ml_start = time.time()
        ml_iterations = [0]
        
        def count_ml_iteration(*args):  # NEW: Separate counter function for ML
            ml_iterations[0] += 1
            return True
            
        with patch('builtins.print', side_effect=count_ml_iteration):  # Use ML-specific counter
            ml_result = smart_compress_to_size(model, img, target_bytes)
            
        ml_time = time.time() - ml_start
        ml_error = abs(len(ml_result) - target_bytes) / target_bytes * 100
        
        # Save results
        results.append({
            'image': os.path.basename(img_path),
            'bs_iterations': bs_iterations[0],
            'ml_iterations': ml_iterations[0],
            'bs_time': bs_time,
            'ml_time': ml_time,
            'bs_error': bs_error,
            'ml_error': ml_error,
            'speedup': bs_time / ml_time
        })
    
    # Print results in a clean table
    print(f"{'Image':<20} {'BS Iter':^8} {'ML Iter':^8} {'BS Time':^8} {'ML Time':^8} {'BS Err%':^8} {'ML Err%':^8} {'Speedup':^8}")
    print("-" * 80)
    
    avg_speedup = 0
    for result in results:
        print(f"{result['image']:<20} {result['bs_iterations']:^8} {result['ml_iterations']:^8} "
              f"{result['bs_time']:.2f}s {result['ml_time']:.2f}s {result['bs_error']:.2f}% "
              f"{result['ml_error']:.2f}% {result['speedup']:.2f}x")
        avg_speedup += result['speedup']
    
    print("-" * 80)
    print(f"Average speedup: {avg_speedup/len(results):.2f}x")
    
    return results
Code
results = evaluate_compression_methods(test_files, model)
Image                BS Iter  ML Iter  BS Time  ML Time  BS Err%  ML Err%  Speedup 
--------------------------------------------------------------------------------
IMG_1281.jpeg           8        6     1.10s 1.80s 0.59% 71.11% 0.61x
IMG_1217.jpeg           7        3     2.12s 2.96s 106.04% 106.04% 0.72x
IMG_1946.jpeg           7        3     1.27s 1.40s 95.53% 95.53% 0.90x
--------------------------------------------------------------------------------
Average speedup: 0.74x

Let’s break down what these numbers tell us:

Iteration Efficiency

Our machine learning model successfully reduced the number of compression attempts needed compared to binary search (average of ~4 iterations vs ~7). This validates our hypothesis that ML can help guide the search process more efficiently. The model is correctly “learning” something about the relationship between image features and compression parameters.

Time Performance

Contrary to my initial expectations, despite requiring fewer iterations, the ML-assisted approach is actually slower overall, with an average speedup of 0.74x (meaning it’s roughly 26% slower than binary search). This is revealing: the computational overhead of extracting image features and running the prediction model outweighs the time saved from fewer compression attempts. This makes sense in retrospect - our feature extraction involves several computationally intensive operations (edge detection, entropy calculation, etc.), and the gradient boosting model prediction isn’t free either. For simple binary search where each step is just a JPEG compression attempt, there’s very little overhead beyond the compression itself.

Accuracy Comparison

The accuracy results show an interesting pattern:

For two images (IMG_1217 and IMG_1946), both methods achieved identical error rates (albeit high ones at 106% and 95%) For one image (IMG_1281), the binary search method was dramatically more accurate (0.59% error vs 71.11%)

This suggests that our ML model struggles with certain types of images, failing to generalize well across the entire dataset. This isn’t entirely surprising - compression efficiency is highly content-dependent, and our feature set, while carefully chosen, may not capture all the nuances that affect JPEG compression.

Why Machine Learning Falls Short

Our findings highlight some fundamental challenges in applying ML to this compression problem:

Feature Extraction Overhead: The time spent calculating edge density, entropy, and other features is substantial. While these calculations happen just once per image, they’re expensive enough to negate the benefit of fewer compression attempts.

Model Generalization: The high error rate for some images indicates our model doesn’t generalize perfectly across different image types. This is a common challenge in ML - the relationship between image features and optimal compression settings is complex and may require more sophisticated features or model architectures.

Discrete Output Space: JPEG quality is an integer parameter, which means our regression model is trying to predict a discrete value. This can lead to step-function behavior where small differences in prediction lead to large differences in results.

Non-linear Relationship: The relationship between quality settings and file size is highly non-linear and varies significantly by image content, making it difficult to model accurately.

Conclusion

While our ML approach didn’t deliver the performance improvements I initially hoped for, it revealed important lessons about the trade-offs involved.

The results remind me that sometimes simple algorithms like binary search are hard to beat, especially when the overhead of a more complex approach outweighs its benefits. This doesn’t mean ML has no place in compression - rather, it suggests we need to be thoughtful about where and how we apply it.

For everyday compression needs, the traditional binary search approach remains efficient and reliable. However, I believe there’s still potential for ML to enhance compression in specialized scenarios, particularly where patterns in the data can be leveraged or where quality optimization is as important as reaching a target file size.

Future work in this area might explore more sophisticated feature engineering, domain-specific models, or hybrid approaches that combine the strengths of traditional algorithms with the predictive power of machine learning. As for that annoying “file must be under 2MB” requirement that inspired this project? For now, I’ll stick with binary search to meet those constraints, but I’m excited about the possibilities that continued research in this area might bring.